AITopics | chemical language model

Collaborating Authors

chemical language model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Chem42: a Family of chemical Language Models for Target-aware Ligand Generation

Singh, Aahan, Tekin, Engin, Nadeem, Maryam, ElNaker, Nancy A., Sayeed, Mohammad Amaan, Vassilieva, Natalia, Amor, Boulbaba Ben

arXiv.org Artificial IntelligenceMar-20-2025

Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Chem42, a cutting-edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic-level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross-modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target-aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target-aware ligand design. The models are publicly available at huggingface.co/inceptionai.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.16563

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

GP-MoLFormer: A Foundation Model For Molecular Generation

Ross, Jerret, Belgodere, Brian, Hoffman, Samuel C., Chenthamarakshan, Vijil, Mroueh, Youssef, Das, Payel

arXiv.org Artificial IntelligenceApr-4-2024

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.

language model, lf ormer, molecule, (16 more...)

arXiv.org Artificial Intelligence

2405.04912

Country: North America > United States (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Difficulty in chirality recognition for Transformer architectures learning chemical structures from string

Yoshikai, Yasuhiro, Mizuno, Tadahaya, Nemoto, Shumpei, Kusuhara, Hiroyuki

arXiv.org Artificial IntelligenceJan-13-2024

Recent years have seen rapid development of descriptor generation based on representation learning of extremely diverse molecules, especially those that apply natural language processing (NLP) models to SMILES, a literal representation of molecular structure. However, little research has been done on how these models understand chemical structure. To address this black box, we investigated the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. We show that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures. Consistently, the accuracy of molecular property predictions using descriptors generated from models at different learning steps was similar from the beginning to the end of training. Furthermore, we found that the Transformer requires particularly long training to learn chirality and sometimes stagnates with low performance due to misunderstanding of enantiomers. These findings are expected to deepen the understanding of NLP models in chemistry.

accuracy, molecule, prediction, (16 more...)

arXiv.org Artificial Intelligence

2303.11593

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government > FDA (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Atom-by-atom protein generation and beyond with language models

Flam-Shepherd, Daniel, Zhu, Kevin, Aspuru-Guzik, Alán

arXiv.org Artificial IntelligenceAug-16-2023

Protein language models learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical language models learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that language models can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate language models are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids. Even further, we find that language models can explore chemical space and protein space simultaneously and generate novel examples of protein-drug conjugates. The results demonstrate the potential for biomolecular design at the atom level using language models.

machine learning, natural language, protein, (18 more...)

arXiv.org Artificial Intelligence

2308.09482

Country:

North America > Canada > Ontario > Toronto (0.16)
North America > United States (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.52)
Health & Medicine > Therapeutic Area > Immunology (0.52)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Explainability Techniques for Chemical Language Models

Hödl, Stefan, Robinson, William, Bachrach, Yoram, Huck, Wilhelm, Kachman, Tal

arXiv.org Artificial IntelligenceMay-25-2023

Explainability techniques are crucial in gaining insights into the reasons behind the predictions of deep learning models, which have not yet been applied to chemical language models. We propose an explainable AI technique that attributes the importance of individual atoms towards the predictions made by these models. Our method backpropagates the relevance information towards the chemical input string and visualizes the importance of individual atoms. We focus on self-attention Transformers operating on molecular string representations and leverage a pretrained encoder for finetuning. We showcase the method by predicting and visualizing solubility in water and organic solvents. We achieve competitive model performance while obtaining interpretable predictions, which we use to inspect the pretrained model.

machine learning, natural language, prediction, (21 more...)

arXiv.org Artificial Intelligence

2305.16192

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Research Report (0.65)
Overview (0.46)

Industry: Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SELFormer: Molecular Representation Learning via SELFIES Language Models

Yüksel, Atakan, Ulusoy, Erva, Ünlü, Atabey, Doğan, Tunca

arXiv.org Artificial IntelligenceMay-25-2023

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2304.04662

Country: North America > United States (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models

#artificialintelligenceMar-25-2022, 22:25:43 GMT

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified mol...

artificial intelligence, molecule ranking and bias estimation, natural language, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language (0.74)

Add feedback

Leveraging molecular structure and bioactivity with chemical language models for drug design

#artificialintelligenceOct-5-2021, 08:05:34 GMT

Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that "hybrid" CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction.

chemical language model, leveraging molecular structure, molecular structure and bioactivity, (5 more...)

#artificialintelligence

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.76)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.66)

Add feedback

Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Segler, Marwin H. S., Kogej, Thierry, Tyrchan, Christian, Waller, Mark P.

arXiv.org Machine LearningJan-5-2017

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active towards a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria) it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

1701.01329

Country:

Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
Europe > Germany > North Rhine-Westphalia > Münster Region > Münster (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback